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Abstract 

The  performance  of  spectral  clustering  can  be  considerably  improved 
via  regularization,  as  demonstrated  empirically  in  Amini  et  al.  [2].  Here, 
we  provide  an  attempt  at  quantifying  this  improvement  through  theo¬ 
retical  analysis.  Under  the  stochastic  block  model  (SBM),  and  its  ex¬ 
tensions,  previous  results  on  spectral  clustering  relied  on  the  minimum 
degree  of  the  graph  being  sufficiently  large  for  its  good  performance.  By 
examining  the  scenario  where  the  regularization  parameter  r  is  large  we 
show  that  the  minimum  degree  assumption  can  potentially  be  removed. 
As  a  special  case,  for  an  SBM  with  two  blocks,  the  results  require  the 
maximum  degree  to  be  large  (grow  faster  than  logn)  as  opposed  to  the 
minimum  degree.  More  importantly,  we  show  the  usefulness  of  regular¬ 
ization  in  situations  where  not  all  nodes  belong  to  well-defined  clusters. 
Our  results  rely  on  a  ‘bias-variance’-like  trade-off  that  arises  from  under¬ 
standing  the  concentration  of  the  sample  Laplacian  and  the  eigen  gap  as  a 
function  of  the  regularization  parameter.  As  a  byproduct  of  our  bounds, 
we  propose  a  data-driven  technique  DKest  (standing  for  estimated  Davis- 
Kahan  bounds)  for  choosing  the  regularization  parameter.  This  technique 
is  shown  to  work  well  through  simulations  and  on  a  real  data  set. 


1  Introduction 

The  problem  of  identifying  communities  (or  clusters)  in  large  networks  is  an 
important  contemporary  problem  in  statistics.  Spectral  clustering  is  one  of  the 
more  popular  techniques  for  such  a  purpose,  chiefly  due  to  its  computational 
advantage  and  generality  of  application.  The  algorithm’s  generality  arises  from 
the  fact  that  it  is  not  tied  to  any  modeling  assumptions  on  the  data,  but  is 
rooted  in  intuitive  measures  of  community  structure  such  as  sparsest  cut  based 
measures  [11],  [24],  [16],  [20].  Other  examples  of  applications  of  spectral  clus¬ 
tering  include  manifold  learning  [4],  image  segmentation  [24],  and  text  mining 
[9], 
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The  canonical  nature  of  spectral  clustering  also  generates  interest  in  vari¬ 
ants  of  the  technique.  Here,  we  attempt  to  better  understand  the  impact  of 
regularized  forms  of  spectral  clustering  for  community  detection  in  networks. 
In  particular,  we  focus  on  the  regularized  spectral  clustering  (RSC)  procedure 
proposed  in  Arnini  et  al.  [2],  Their  empirical  findings  demonstrates  that  the 
performance  of  the  RSC  algorithm,  in  terms  of  obtaining  the  correct  clusters, 
is  significantly  better  for  certain  values  of  the  regularization  parameter.  An 
alternative  form  of  regularization  was  studied  in  Chaudhuri  et  al.  [7]  and  Qin 
and  Rohe  [22]. 

This  paper  provides  an  attempt  to  provide  a  theoretical  understanding  for 
the  regularization  in  the  RSC  algorithm.  We  also  propose  a  practical  scheme 
for  choosing  the  regularization  parameter  based  on  our  theoretical  results.  Our 
analysis  focuses  on  the  Stochastic  Block  Model  (SBM)  and  an  extension  of  this 
model.  Below  are  the  three  main  contributions  of  the  paper. 

(a)  We  attempt  to  understand  regularization  for  the  stochastic  block  model. 
In  particular,  for  a  graph  with  n  nodes,  previous  theoretical  analyses  for 
spectral  clustering,  under  the  SBM  and  its  extensions,  [23], [7],  [25],  [10]  as¬ 
sumed  that  the  minimum  degree  of  the  graph  scales  at  least  by  a  polynomial 
power  of  logn.  Even  when  this  assumption  is  satisfied,  the  dependence  on 
the  minimum  degree  is  highly  restrictive  when  it  comes  to  making  inferences 
about  cluster  recovery.  Our  analysis  provides  cluster  recovery  results  that 
potentially  do  not  depend  on  the  above  mentioned  constraint  on  the  min¬ 
imum  degree.  As  an  example,  for  an  SBM  with  two  blocks  (clusters),  our 
results  require  that  the  maximum  degree  be  large  (grow  faster  than  logn) 
rather  than  the  minimum  degree.  This  is  done  in  Section  3. 

(b)  We  demonstrate  that  regularization  has  the  potential  of  addressing  a  situ¬ 
ation  where  the  lower  degree  nodes  do  not  belong  to  well-defined  clusters. 
Our  results  demonstrate  that  choosing  a  large  regularization  parameter  has 
the  effect  of  removing  these  relatively  lower  degree  nodes.  Without  regu¬ 
larization,  these  nodes  would  hamper  with  the  clustering  of  the  remaining 
nodes  in  the  following  way:  In  order  for  spectral  clustering  to  work,  the  top 
eigenvectors  -  that  is,  the  eigenvectors  corresponding  to  the  largest  eigenval¬ 
ues  of  the  Laplacian  -  need  to  be  able  to  discriminate  between  the  clusters. 
Due  to  the  effect  of  nodes  that  do  not  belong  to  well-defined  clusters  these 
top  eigenvectors  do  not  necessarily  discriminate  between  the  clusters  with 
ordinary  spectral  clustering.  This  is  done  in  Section  4 

(c)  Although  our  theoretical  results  deal  with  the  ‘large’  r  case,  it  is  observed 
empirically  that  moderate  values  of  r  may  produce  better  clustering  per¬ 
formance.  Consequently,  in  Section  5  we  propose  DKest ,  a  data  dependent 
procedure  for  choosing  the  regularization  parameter.  We  demonstrate  that 
this  works  well  through  simulations  and  on  a  real  data  set.  This  is  in  Section 
5. 

Our  theoretical  results  involve  understanding  the  trade-offs  between  the 
eigen  gap  and  the  concentration  of  the  sample  Laplacian  when  viewed  as  a 
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function  of  the  regularization  parameter.  Assuming  that  there  are  K  clusters, 
the  eigen  gap  refers  to  the  gap  between  the  A'-th  smallest  eigenvalue  and  the 
remaining  eigenvalues.  An  adequate  gap  ensures  that  the  sample  eigenvectors 
can  be  estimated  well  ([26],  [20],  [16])  which  leads  to  good  cluster  recovery. 
The  adequacy  of  an  eigen  gap  for  cluster  recovery  is  in  turn  determined  by  the 
concentration  of  the  sample  Laplacian. 

In  particular,  a  consequence  of  the  Davis-Kahan  theorem  [5]  is  that  if  the 
spectral  norm  of  the  difference  of  the  sample  and  population  Laplacians  is  small 
compared  to  the  eigen  gap  then  the  top  K  eigenvector  can  be  estimated  well. 
Denoting  r  as  the  regularization  parameter,  previous  theoretical  analyses  of 
regularization  ([7],  [23])  provided  high-probability  bounds  on  this  spectral  norm. 
These  bounds  have  a  1  / y/r  dependence  on  r,  for  large  r.  In  contrast,  our  high 
probability  bounds  behave  like  1/r,  for  large  r.  We  also  demonstrate  that  the 
eigen  gap  behaves  like  1/r  for  large  r.  The  end  result  is  that  we  show  that  one 
can  get  a  good  understanding  of  the  impact  of  regularization  by  understanding 
the  situation  where  r  goes  to  infinity.  This  also  explains  empirical  observations 
in  [2] ,  [22]  where  it  was  seen  that  performance  of  regularized  spectral  clustering 
does  not  change  for  r  beyond  a  certain  value.  Our  procedure  for  choosing 
the  regularization  parameter  works  by  providing  estimates  of  the  Davis-Kahan 
bounds  over  a  grid  of  values  of  r  and  then  choosing  the  r  that  minimizes  these 
estimates. 

The  paper  is  divided  as  follows.  In  the  next  subsection  we  discuss  prelim¬ 
inaries.  In  particular,  in  Subsection  1.1  we  review  the  RSC  algorithm  of  [2], 
and  also  discuss  the  other  forms  of  regularization  in  literature.  In  Section  2  we 
review  the  stochastic  block  model.  Our  theoretical  results,  described  in  (a)  and 
(b)  above,  are  provided  in  Sections  3  and  4.  Section  5  describes  our  DKest 
data  dependent  method  for  choosing  the  regularization  parameter. 


1.1  Regularized  spectral  clustering 


In  this  section  we  review  the  regularized  spectral  clustering  (RSC)  algorithm  of 
Amini  et  al.  [2]. 

We  first  introduce  some  basic  notation.  A  graph  with  n  nodes  and  edge  set 
E  is  represented  by  the  n  x  n  symmetric  adjacency  matrix  A  =  ((Ay)),  where 
Aij  =  1  if  there  is  an  edge  between  i  and  j,  otherwise  Ay  is  0.  In  other  words, 
for  1  <  i,  j  <  n, 


1,  if  ( i ,  j )  G  E 
0,  otherwise 


Given  such  a  graph,  the  typical  community  detection  problem  is  synonymous 
with  finding  a  partition  of  the  nodes.  A  good  partitioning  would  be  one  in  which 
there  are  fewer  edges  between  the  various  components  of  the  partition,  compared 
to  the  number  of  edges  within  the  components.  Various  measures  for  goodness 
of  a  partition  have  been  proposed,  chiefly  the  Ratio  Cut  [11]  and  Normalized 
Cut  [24]  .  However,  minimization  of  the  above  measures  is  an  NP-hard  problem 
since  it  involves  searching  over  all  partitions  of  the  nodes.  The  significance 
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of  spectral  clustering  partly  arises  from  the  fact  that  it  provides  a  continuous 
approximation  to  the  above  discrete  optimization  problem  [11],  [24]. 

We  now  describe  the  RSC  algorithm  [2].  Denote  by  D  =  diag{d\ , . . .  ,dn) 
the  diagonal  matrix  of  degrees,  where  d;  =  A,tj .  The  normalized  (unreg¬ 
ularized)  symmetric  graph  Laplacian  is  defined  as 

L  =  D-1/2AD~1/2. 

Regularization  is  introduced  in  the  following  way:  Let  J  be  a  constant  ma¬ 
trix  with  all  entries  equal  to  1/n.  Then,  in  regularized  spectral  clustering  one 
constructs  a  new  adjacency  matrix  by  adding  t  J  to  the  adjacency  matrix  A  and 
computing  the  corresponding  Laplacian.  In  particular,  let 

At  —  A  T  t  J, 

where  r  >  0  is  the  regularization  parameter.  The  corresponding  regularized 
symmetric  Laplacian  is  defined  as 

lt  =  d~x/2atd-x/2.  (1) 

Here,  DT  =  diag(ditT,  ...,dn>T)  is  the  diagonal  matrix  of  ‘degrees’  of  the 
modified  adjacency  matrix  AT.  In  other  words,  dlT  =  di  +  r. 

The  RSC  algorithm  for  finding  K  communities  is  described  in  Algorithm  1. 
In  order  to  bring  to  the  forefront  the  dependence  on  r,  we  also  denote  the  RSC 
algorithm  as  RSC-r.  The  algorithm  first  computes  VT,  the  n  x  K  eigenvector 
matrix  corresponding  to  the  K  largest  eigenvalues  of  LT.  The  columns  of  VT  are 
taken  to  be  orthogonal.  The  rows  of  VT,  denoted  by  Vj)T,  for  i  =  1, . . . ,  n,  corre¬ 
sponds  to  the  nodes  in  the  graph.  Clustering  the  rows  of  VT,  for  example  using 
the  A'-means  algorithm,  provides  a  clustering  of  the  nodes.  We  remark  that  the 
RSC-0  Algorithm  corresponds  to  the  usual  spectral  clustering  algorithm. 


Algorithm  1  The  RSC-r  Algorithm  [2] 

Input  :  Laplacian  matrix  LT. 

Step  1:  Compute  the  n  x  K  eigenvector  matrix  VT. 

Step  2:  Use  the  A"-nreans  algorithm  to  cluster  the  rows  of  VT  into  K  clusters. 


Our  theoretical  results  assume  that  the  data  is  randomly  generated  from  a 
stochastic  block  model  (SBM),  which  we  review  in  the  next  subsection.  While  it 
is  well  known  that  there  are  real  data  examples  where  the  SBM  fails  to  provide 
a  good  approximation,  we  believe  that  the  above  provides  a  good  playground  for 
understanding  the  role  of  regularization  in  the  RSC  algorithm.  Recent  works  [2] , 
[10],  [23],  [6],  [14]  have  used  this  model,  and  its  variants,  to  provide  a  theoretical 
analyses  for  various  community  detection  algorithms. 

In  Chaudhuri  et  al.  [7],  the  following  alternative  regularized  version  of  the 
symmetric  Laplacian  is  proposed: 

Ldeg,r  =  D~X!2  AD~XI2 .  (2) 
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Here,  the  subscript  deg  stands  for  ‘degree’  since  the  usual  Laplacian  is  modified 
by  adding  r  to  the  degree  matrix  D.  Notice  that  for  the  RSC  algorithm  the 
matrix  A  in  the  above  expression  was  replaced  by  AT. 

As  mentioned  before,  we  attempt  to  understand  regularization  in  the  frame¬ 
work  of  the  SBM  and  its  extension.  We  review  the  SBM  in  the  next  section. 
Using  recent  results  on  the  concentration  of  random  graph  Laplacians  [21],  we 
were  able  to  show  concentration  results  in  Theorem  4  for  the  regularized  Lapla¬ 
cian  in  the  RSC  algorithm.  Previous  concentration  results  for  the  Laplacian  (2), 
as  in  [7] ,  provide  high  probability  bounds  on  the  spectral  norm  of  the  difference 
of  the  sample  and  population  regularized  Laplacians  that  depends  inversely  on 
1/y/r.  However,  for  the  regularization  (1)  we  show  that  the  dependence  is  in¬ 
verse  in  r,  for  large  r.  We  believe  that  this  holds  for  the  regularization  (2) 
as  well.  We  also  demonstrate  that  the  eigen  gap  depends  inversely  on  r,  for 
large  r.  The  benefit  of  this,  along  with  our  improved  concentration  bounds,  is 
that  one  can  understand  regularization  by  looking  at  the  case  where  r  is  large. 
This  results  in  a  very  neat  criterion  for  the  cluster  recovery  with  the  RSC-r 
algorithm. 


2  The  Stochastic  Block  Model 

Given  a  set  of  n  nodes,  the  stochastic  block  model  (SBM),  introduced  in  [12], 
is  one  among  many  random  graph  models  that  has  communities  inherent  in  its 
definition.  We  denote  the  number  of  communities  in  the  SBM  by  I\ .  Through¬ 
out  this  paper  we  assume  that  K  is  known.  The  communities,  which  represent 
a  partition  of  the  n  nodes,  are  assumed  to  be  fixed  beforehand.  Denote  these  by 
C i,  . . . ,  Ck-  Let  rife,  for  k  =  1, ... ,  K,  denote  the  number  of  nodes  belonging 
to  each  of  the  clusters. 

Given  the  communities,  the  edges  between  nodes,  say  i  and  j.  are  chosen 
independently  with  probability  depending  the  communities  i  and  j  belong  to. 
In  particular,  for  a  node  i  belonging  to  cluster  CkA ,  and  node  j  belonging  to 
cluster  Cfc2 ,  the  probability  of  edge  between  i  and  j  is  given  by 

'  ’  /  —  Bklk2. 

Here,  the  block  probability  matrix 

B  =  where  ku  k2  =  1,...,K 

is  a  symmetric  full  rank  matrix,  with  each  entry  between  [0,1].  The  n  x  n 
edge  probability  matrix  P  =  ((Pij)),  given  by  (3),  represents  the  population 
counterpart  of  the  adjacency  matrix  A. 

Denote  Z  =  as  the  n  x  K  binary  matrix  providing  the  cluster  mem¬ 

berships  of  each  node.  In  other  words,  each  row  of  Z  has  exactly  one  1,  with 
Zik  =  1  if  node  i  belongs  to  Ck-  Notice  that, 


P  =  ZBZ'. 


(3) 
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Here  Z'  denotes  the  transpose  of  Z.  Consequently,  from  (3),  it  is  seen  that  the 
rank  of  P  is  also  K . 

The  population  counterpart  for  the  degree  matrix  D  is  denoted  by  = 
diag(di, . . . ,  dn),  where  @  =  diag(Pl).  Here  1  denotes  the  column  vector  of 
all  ones.  Similarly,  the  population  version  of  the  symmetric  Laplacian  LT  is 
denoted  by  j£fr,  where 

js?t  =  ®-1/2pt$>-1/2. 

Here  @>T  =  +  tI  and  PT  =  P  +  tJ.  The  n  x  n  matrices  S>T  and  PT  represent 

the  population  counterparts  to  DT  and  AT  respectively.  Notice  that  since  P  has 
rank  K ,  the  same  holds  for  Jz?T. 

2.1  Notation 

We  use  ||.||  to  denote  the  spectral  norm  of  a  matrix.  Notice  that  for  vectors 
this  corresponds  to  the  usual  t2-norm.  We  use  A!  to  denote  the  transpose  of  a 
matrix,  or  vector,  A. 

For  positive  an.  bni  we  use  the  notation  an  >:  bn  if  there  exists  universal 
constants  ci,  c2  >  0  so  that  c\an  <  bn  <  c2a„.  Further,  we  use  bn  <  an  if 
bn  <  c2a„,  for  some  positive  c2  not  depending  on  n.  The  notation  bn  >  an  is 
analogously  defined. 

The  quantities 

dmin,n  =  min  (l, ,  d.n!ax,n  —  max  di 

denote  the  minimum  and  maximum  expected  degrees  of  the  nodes. 

2.2  The  Population  Cluster  Centers 

We  now  proceed  to  define  population  cluster  centers  centfcjT  €  RK,  for  k  = 
1, . . . ,  K,  for  the  K  block  SBM.  These  points  are  defined  so  that  the  rows  of  the 
eigenvector  matrix  V*)T,  for  i  £  Ck,  are  expected  to  be  scattered  around  centfcjT. 

Denote  by  YT  an  n  x  I\  matrix  containing  the  eigenvectors  of  the  I\  largest 
eigenvalues  of  the  population  Laplacian  Jz?T .  As  with  VT ,  the  columns  of  %■  are 
also  assumed  to  be  orthogonal. 

Notice  that  both  yT  and  are  eigenvector  matrices  corresponding  to  A£T. 
This  ambiguity  in  the  definition  of  is  further  complicated  if  an  eigenvalue  of 
Jz?T  has  multiplicity  greater  than  one.  We  do  away  with  this  ambiguity  in  the 
following  way:  Let  TL  denote  the  set  of  all  n  x  I\  eigenvector  matrices  of  Jz?T 
corresponding  to  the  top  K  eigenvalues.  We  take, 

%  =  argmfn  \\VT  -  H\\,  (4) 

where  recall  that  ||.||  denotes  the  spectral  norm.  The  matrix  as  defined 
above,  represents  the  population  counterpart  of  the  matrix  VT. 

Let  denote  the  i-th  row  of  "VT.  Notice  that  since  the  set  H  is  closed  under 
the  ||.||  norm,  one  has  that  "Vt  is  also  an  eigenvector  matrix  of  JZ?T  corresponding 
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to  the  top  K  eigenvalues.  Consequently,  the  rows  T/r  are  the  same  across  nodes 
belonging  to  a  particular  cluster  (See,  for  example,  Rohe  et  al.  [23]  for  a  proof 
of  this  fact).  In  other  words,  there  are  K  distinct  rows  of  ^jT,  with  each  row 
corresponding  to  nodes  from  one  of  the  K  clusters. 

Notice  that  the  matrix  'flT  depends  on  the  sample  eigenvector  matrix  VT 
through  (4),  and  consequently  is  a  random  quantity.  However,  the  following 
lemma  shows  that  the  pairwise  distances  between  the  rows  of  are  non- 
random  and,  more  importantly,  independent  of  r. 

Lemma  1.  Let  i  £  Ck  and  i'  £  Ck1-  Then, 


\\%,T  - 


if  k  =  k! 
if  kf^k! 


From  the  above  lemma,  there  are  /(  distinct  rows  of  T}.  corresponding  to 
the  K  clusters.  We  denote  these  as  center, . . . ,  cent/^T.  We  also  call  these  the 
population  cluster  centers  since,  intuitively,  in  an  idealized  scenario  the  data 
points  V);.r,  with  i  £  Ck,  should  be  concentrated  around  centfcjT. 


2.3  Cluster  recovery  using  /(-means  algorithm 

Recall  that  the  RSC-r  Algorithm  1  works  by  performing  A"-nreans  clustering  on 
the  rows  of  the  nxK  sample  eigenvector  matrix,  denoted  by  V)jT,  for  i  =  1, . . . ,  n. 
In  this  section,  in  particular  Corollary  3,  we  relate  the  fraction  of  mis-clustered 
nodes  using  the  /(-means  algorithm  to  the  various  parameters  in  the  SBM. 

In  general,  the  /(-means  algorithm  can  be  described  as  follows:  Assume 
one  wants  to  find  /(  clusters,  for  a  given  set  of  data  points  Xy  €  M.K ,  for  i  = 
1 ,...,/(.  Then  the  /(-clusters  resulting  from  applying  the  /(-means  algorithm 
corresponds  to  a  partition  T  =  {T\, . . . ,  Tk}  of  {1, . . . ,  n}  that  aims  to  minimize 
the  following  objective  function  over  all  such  partitions: 

K 

Obj(T)  II^^tJI2,  (5) 

k—1  idzTk 

Here  T  =  {Tj, . . .  ,Tk}  is  a  partition  {1, . . .  ,n},  and  xrk  corresponds  to  the 
vector  of  component-wise  means  of  the  Xi ,  for  i  £  Tk- 

In  our  situation  there  is  also  an  underlying  true  partition  of  nodes  into  clus¬ 
ters,  given  by  C  =  {Cj , . . . ,  Ck}.  Notice  that  C  =  T  iff  there  is  a  permutation 
7 r  of  {1, . . . ,  /(}  so  that  Ck  =  Tn(k),  for  k  =  1, . . . ,  K.  In  general,  we  use  the 
following  measure  to  quantify  the  closeness  of  the  outputted  partition  T  and 
the  true  partition  C:  Denote  the  clustering  error  associated  with  7j , . . . ,  TK  as 


/  =  min  max 

7T  k 


\Cknfz(k)\  +  \c°knf„w\ 

nk 


(6) 


The  clustering  error  measures  the  maximum  proportion  of  nodes  in  the  sym¬ 
metric  difference  of  Ck  and  Tw(k)- 
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In  many  situations,  such  as  ours,  there  exists  population  quantities  asso¬ 
ciated  with  each  cluster  around  which  the  xf  s  are  expected  to  concentrate. 
Denote  these  quantities  by  mi, . . . ,  rriK-  In  our  case,  rrik  =  centfcjT.  If  the  xfs, 
for  i  £  Cf.,  concentrate  well  around  m*,,  and  the  m^’s  are  sufficiently  well  sep¬ 
arated,  then  it  is  expected  the  A"- means  algorithm  recovers  the  clusters  with 
small  error  /. 

Denote  X  as  the  n  x  K  matrix  with  xf s  as  rows.  In  our  case,  the  x%  =  V).r, 
and  X  =  VT.  Further,  denote  as  M  the  n  x  K  matrix  with  the  mfc’s  as  rows. 
In  our  case,  M  =  yT.  Recent  results  on  cluster  recovery  using  the  AT-means 
algorithm,  as  given  in  Kumar  and  Kannan  [15]  and  Awasthi  and  Slreffet  [3], 
provide  conditions  on  X  and  M  for  the  success  of  Af-means.  The  following 
lemma  is  implied  from  Theorem  3.1  in  Awasthi  and  Sheffet  [3]. 

Lemma  2.  Let  S  >  0  be  a  small  quantity.  If  for  each  1  <  k  ^  k'  <  K,  one  has 
IK  -  ||  >  (i)  sfK\\X  -  Mil  +  ^j=)  (7) 

then  the  clustering  error  f  =  O  ( 5 2)  using  the  K-means  algorithm. 

Remark  :  In  general  minimizing  the  objective  function  (5)  is  not  computa¬ 
tionally  feasible.  However,  the  results  in  [15],  [3]  can  be  extended  to  partitions 
A  that  approximately  minimize  (5).  The  condition  (7),  called  the  center  sepa¬ 
ration  condition  in  [3] ,  provides  lower  bounds  on  the  pairwise  distances  between 
the  population  cluster  centers  that  depend  on  the  perturbation  of  data  points 
around  the  population  centers  (represented  by  || X  —  M ||)  and  the  cluster  sizes. 

Let 

1  —  Ml,T  ^  •  A  pn.T 

be  the  eigenvalues  of  the  regularized  population  Laplacian  Jz?T  arranged  in  de¬ 
creasing  order.  The  fact  that  ji-\  jT  is  1  follows  from  standard  results  on  the 
spectrum  of  Laplacian  matrices  (see,  for  example,  [26]).  As  mentioned  in  the 
introduction,  in  order  to  control  the  perturbation  of  the  first  K  eigenvectors  the 
eigen  gap,  given  by  Pk,t  —  must  be  adequately  large,  as  noted  in  [26], 

[20],  [16].  Since  «5?r  has  rank  K  one  has  pk+i.t  =  0.  Thus  the  eigen  gap  is 
simply  Pk,t-  For  our  A'-block  SBM  framework  the  following  is  an  immediate 
consequence  of  Lemma  2  and  the  Davis-Kahan  theorem  for  the  perturbation  of 
eigenvectors. 

Corollary  3.  Let  r  >  0  be  fixed.  For  the  RSC-t  algorithm  the  clustering  error, 
given  by  (6),  is 

„(K\\Lr-j?rr\ 

V  4,r  ) 

Proof.  Use  Lemma  2  with  mk  =  cent*.  T,  X  =  VT,  M  =  and  notice  that 
from  Lemma  1  that  ||mfc  —  m*,'  ||  is  s/l/nt-  +  1  /rx±'. 


Consequently,  using  1/y/Wk  +  l/^/nk'  >  \/l /rife  +  1/rifc'  one  gets  from  (7) 
that  if 

||K-rr||<-^,  (8) 

for  some  <5  >  0,  then  at  most  0(52)  fraction  of  nodes  are  misclassified  with  the 
RSC-r  algorithm. 

From  the  Davis-Kahan  theorem  [5] ,  one  has 

\\vT-yT\\<^T~^  (9) 

Consequently,  if  we  take  5  =  (y/K\\LT  —  J£t\\)/ then  relation  (8)  is  satisfied 
using  (9).  This  proves  the  corollary.  □ 


3  Improvements  through  regularization 


In  this  section  we  will  use  Corollary  3  to  quantify  improvements  in  clustering 
performance  via  regularization.  If  the  number  of  clusters  K  is  fixed  (does  not 
grow  with  n)  then  the  quantity 


WLt-^tW 


(10) 


in  Corollary  3  provides  an  insight  into  the  role  of  the  regularization  parameter 
r.  Clearly,  an  ideal  choice  of  r  would  be  the  one  that  minimizes  (10).  Note, 
however,  that  this  is  not  practically  possible  since  Jz?T,  /i k,t  are  not  known  in 
advance. 

Increasing  r  will  ensure  that  the  Laplacian  LT  will  be  well  concentrated 
around  Jz?T.  This  is  demonstrated  in  Theorem  4  below.  However,  increasing  r 
also  has  the  effect  of  decreasing  the  eigen  gap,  which  in  this  case  is  hk,t,  since 
the  population  Laplacian  becomes  more  like  a  constant  matrix  upon  increasing 
t.  Thus  the  optimum  r  results  from  the  balancing  out  of  these  two  competing 
effects. 

Independent  of  our  work,  a  similar  argument  for  the  optimum  choice  of  reg¬ 
ularization,  using  the  Davis-Kahan  theorem,  was  given  in  Qin  and  Rohe  [22]  for 
the  regulariztion  proposed  in  [7].  However,  they  didn’t  provide  a  quantification 
of  the  benefit  of  regularization  as  given  in  this  section  and  Section  4. 

Theorem  4  provides  liigh-probability  bounds  on  the  quantity  ||Lr  —  2z?r|| 
appearing  in  the  numerator  of  (10).  Previous  analysis  of  the  regularization  (2), 
in  [7],  [22],  show  liigh-probability  bounds  on  the  aforementioned  spectral  norm 
that  have  a  \/ ^ dmin,n  +  t  dependence  on  r.  However,  for  large  r,  the  theorem 
below  shows  that  the  behavior  is  •*/ dmaXtn/ (dmaX}n  +  t).  We  believe  this  holds 
for  the  regularization  (2)  as  well.  Thus,  our  bounds  has  a  1/r  dependence  on 
r,  for  large  r,  as  opposed  to  the  1/Vr  dependence  shown  in  [7].  This  is  crucial 
since  the  eigen  gap  /r k,t  also  behaves  like  1/r  for  large  r  which  implies  that 
(10)  converges  to  a  quantity  as  r  tends  to  infinity.  In  Theorem  5  we  provide  a 
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bound  on  this  quantity.  Our  claims  regarding  improvements  via  regularization 
will  then  follow  from  comparing  this  bound  with  the  bound  on  (10)  at  r  =  0. 

Theorem  4.  With  probability  at  least  1  —  2 /n,  for  all  r  satisfying 


we  have 

Here 


ma x{t,  drain, n}  >  32  log  n, 


\Lr-S?J\  < 


t  ||  —  '~T,n  • 


10\/log  n 

y/drnin.n+T- 


if  t  <  2  dr 


T-Nr,  ifT>2dn 

'-Lmax ,  n  i  '  /  ^ 


(11) 

(12) 


We  use  Theorem  4,  along  with  Corollary  3,  to  demonstrate  improvements 
from  regularization  over  previous  analyses  of  eigenvector  perturbation.  Our 
strategy  for  this  is  a  follows:  Take 
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r,n 


^r,n 

HK,t 


Notice  that  from  Corollary  3  and  Theorem  4,  one  gets  that  with  probability  at 
least  1  —  2/n,  for  all  r  satisfying  (11),  the  clustering  error  is  0(S"fn).  Conse¬ 
quently,  it  is  of  interest  to  study  the  quantity  5Ti7l  as  a  function  of  r.  Define, 


Sn  =  lim  ST,n-  (13) 

t  — yoo 


Although  we  would  have  ideally  liked  to  study  the  quantity, 


Sn  =  min  (5T,n 

max{r,  dmin  ri}>log  n 


we  study  Sn  since  it  is  easy  to  characterize  as  we  shall  see  in  Theorem  5  below. 
Section  5  introduces  a  data-driven  methodology  that  is  based  on  finding  an 
approximation  for  5n. 

Before  introducing  our  main  theorem  quantifying  the  performance  of  RSC-r 
for  large  r  we  introduce  the  follow  definition. 

Definition  1.  Let  {rn,  n  >  1}  be  a  sequence  of  the  regularization  parameters. 
For  the  K-block  SBM  we  say  that  RSC-rn  gives  consistent  cluster  estimates  if 
the  error  (6)  goes  0,  with  pi'obability  tending  to  1,  as  n  goes  to  infinity. 

Throughout  the  remainder  of  the  section  we  consider  a  /t-block  stochastic 
block  model  with  the  following  block  probability  matrix. 


(  Phn 

Qn 

Qn 

Qn 

P2,n 

Qn 

V  ••• 

Qn  PK', 

(14) 
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The  number  of  communities  K  is  assumed  to  be  fixed.  Without  loss,  assume 
that  p\tn  >  p2,n  •  •  ■  >  PK,n-  We  also  assume  that  qn  <  pk,u-  Denote  Wk  =  n-k/n, 
for  k  =  1, . . . ,  K.  The  quantity  Wk  represents  the  proportion  of  nodes  belonging 
to  the  fc-th  community.  Throughout  this  section  we  assume  that  {r„  :  n  >  1} 
is  a  sequence  of  regularization  parameters  satisfying, 

(  1  /  Wk  )  dmax,n  log  ri 

- f - =o(l)  (15) 

Notice  that  if  the  cluster  sizes  are  of  the  same  order,  that  is  Wk  x  1,  then  the 
above  condition  simply  states  that  rn  should  grow  faster  than  dmaXtn  log  n. 

Denote  7 k,n  =  nk(Pk,n  —  Qn)-  The  following  is  our  main  result  regarding  the 
impact  of  regularization. 

Theorem  5.  For  the  K  block  SBM,  with  block  probability  matrix  (14), 


(tni,„mi,„  -  m2,n )  r-r 
On  ^  V  Qn 


logn. 


Here  Sn  is  given  by  (13)  and 


mi,n 


mi,n 


m,2,n 


E 


wk 

T  k,n 


E 


1 


E 


Wk 
7  l,n 


(16) 


(17) 

(18) 
(19) 


Further,  let  {rn,  n  >  1}  satisfy  (15).  If  Sn  goes  to  0,  as  n  tends  to  infinity,  then 
RSC-Tn  gives  consistent  cluster  estimates. 


Theorem  5  will  be  proved  in  Appendix  B.  In  particular,  the  following  corol¬ 
lary  shows  that  for  the  stochastic  block  model  regularized  spectral  clustering 
would  work  even  when  the  minimum  degree  is  of  constant  order.  This  is  an  im¬ 
provement  over  recent  works  on  unregularized  spectral  clustering,  such  as  [18], 
[7],  [23],  which  required  the  minimum  degree  to  grow  at  least  as  fast  as  logn. 


Corollary  6.  Let  the  block  probability  matrix  B  be  as  in  (14).  Let  {rn,  n  >  1} 
satisfy  (15).  Then  RSC-Tn  gives  consistent  cluster  estimates  under  the  following 
scenarios: 


i)  For  the  K -block  SBM  if  Wk  x  1,  for  each  k  =  1, . . . ,  K,  and 


(PK-i,n-qn)2  ,  ,  ,,  logn 

-  qrows  taster  than  - . 

Pl,n  n 


(20) 
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logn 

n  (min{u>i,  W2})2 


(21) 


ii)  For  the  2-block  SBM  if  P2  =  q  and 


(pi,n  -  dn)2 

U!lPl,n  +  W2qn 


grows  faster  than 


Remark  :  Regime  i)  deals  with  the  situation  that  the  clusters  sizes  are  of 
the  same  order  of  magnitude.  Regime  ii),  where  p2,ri  =  qn  mimics  a  scenario 
where  there  is  only  one  cluster.  This  is  a  generalization  of  the  planted  clique 
problem  where  piiTl  =  1  and  p2,n  =  q  =  1/2.  For  the  planted  clique  problem 
(21)  translates  to  requiring  that  min{u;i,  W2}  grow  faster  that  v^og n/y/n  for 
consistent  cluster  estimates,  which  is  similar  to  results  in  [18]. 


I 


(a)  Unregularized  (r  =  0) 


(b)  Regularized  (r  =  26.5) 


(c)  Regularized  (r  =  n) 


Figure  1:  Scatter  plot  of  first  two  eigenvectors  with  B  as  in  (22).  The  x,  y 
axes  provides  values  for  the  first,  second  eigenvectors  respectively.  The  colors 
corresponds  to  the  cluster  memberships  of  the  nodes.  Here  the  block  probability 
matrix  B  is  as  in  (22).  Plot  a)  corresponds  to  r  =  0.  b)  r  =  26.5,  selected  using 
our  data-driven  DKest  methodology  proposed  in  Section  5.  c)  r  =  n. 

Notice  that  in  both  (20)  and  (21)  the  minimum  degree  could  be  of  constant 
order.  For  example,  for  the  two-block  SBM  if  qn,P'2,n  =  0(l/n)  then  the  mini¬ 
mum  degree  is  of  constant  order.  In  this  case  ordinary  spectral  clustering  using 
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the  normalized  Laplacian  would  perform  poorly.  RSC  performs  better  since 
from  (20)  it  only  requires  that  the  larger  of  the  two  within  block  probabilities, 
that  is  pit„ ,  growing  appropriately  fast.  Figure  1  illustrates  this  with  n  =  3000 
and  edge  probability  matrix 


f  .01  .0025  \ 

V  .0025  .003  )  ' 


(22) 


The  figure  provides  the  scatter  plot  of  the  first  two  eigenvectors  of  the  unregular¬ 
ized  and  regularized  sample  Laplacians.  Figure  a)  corresponds  to  the  usual  spec¬ 
tral  clustering,  while  plots  b)  &  c)  corresponds  to  RSC-r,  with  r  =  26.5,  3000 
respectively.  Flere,  r  =  26.5  was  selected  using  our  data-driven  methodology  for 
sleeting  r  proposed  in  Section  5.  Also,  r  =  3000  was  selected  as  suggested  from 
Theorem  5  and  Corollary  6.  The  fraction  of  mis-classified  are  26%,  4%,  6%  for 
the  cases  a),  b),  c)  respectively. 

From  the  scatter  plots  one  sees  that  there  is  considerably  less  scattering  for 
the  blue  points  with  regularization.  This  results  in  improvements  in  clustering 
performance.  Also,  note  that  the  performance  in  case  c),  in  which  r  is  taken 
to  be  very  large,  is  only  slightly  worse  than  case  b).  For  case  c)  there  is  almost 
no  variation  in  the  first  eigenvector,  plotted  along  the  rr-axis.  This  makes  sense 


since  the  first  eigenvector  is  proportional  to 


and  for  large  r 


one  has  y  d»,T  ~  y/r. 

It  may  seem  surprising  that  in  Corollary  6,  claim  (20),  the  smallest  within 
block  probability,  that  is  Px,n  does  not  matter  at  all.  One  way  of  explaining 
this  is  that  if  one  can  do  a  good  job  identifying  the  top  K  —  1  highest  degree 
clusters  then  the  cluster  with  the  lowest  degree  can  also  be  identified  simply  by 
eliminating  nodes  not  belonging  to  this  cluster. 


4  SBM  with  strong  and  weak  clusters 

In  many  practical  situations,  not  all  nodes  belong  to  clusters  that  can  be  esti¬ 
mated  well.  As  mentioned  in  the  introduction,  these  nodes  interfere  with  the 
clustering  of  the  remaining  nodes  in  the  sense  that  none  of  the  top  eigenvectors 
might  discriminate  between  the  nodes  that  do  belong  to  well-defined  clusters. 
As  an  example  of  a  real  life  data  set,  we  consider  the  political  blogs  data  set, 
which  has  two  clusters,  in  Subsection  5.2.  With  ordinary  spectral  clustering,  the 
top  two  eigenvectors  do  not  discriminate  between  the  two  clusters  (see  Figure 
2  for  explanation).  Infact,  it  is  only  the  third  eigenvector  that  discriminates 
between  the  two  clusters.  This  results  in  bad  clustering  performance  when  the 
first  two  eigenvectors  are  considered.  However,  regularization  rectifies  this  prob¬ 
lem  by  ‘bringing  up’  the  important  eigenvector  thereby  allowing  for  much  better 
performance. 

We  model  the  above  situation  -  where  there  are  main  clusters  as  well  as 
outlier  nodes  -  in  the  following  way:  Consider  a  stochastic  block  model,  as  in 
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Figure  2:  Depiction  of  the  political  blog  network  [1],  Instead  of  discriminating 
between  the  red  and  blue  nodes,  the  second  eigenvector  discriminates  the  small 
cluster  of  4  nodes  (circled)  from  the  remaining.  This  results  in  bad  clustering 
performance. 


(14),  with  K  +  Kw  blocks, 
given  by 


In  particular,  let  the  block  probability  matrix  be 


f  Bs  Bsw  \ 

V  Bw  J 


(23) 


where  Bs  is  a  K  x  K  matrix  with  (piin, . . .  ,px,n)  in  the  diagonal  and  qn  in  the 
off-diagonal.  Further,  Bsw ,  Bw  are  I\  x  Kw  and  Kw  x  Kw  dimensional  matrices 
respectively.  In  the  above  ( I<  +  AA,)-block  SBM,  the  top  I\  blocks  corresponds 
to  the  well-defined  or  strong  clusters,  while  the  bottom  Kw  blocks  corresponds 
to  less  well-defined  or  weak  clusters. 

We  now  formalize  our  notion  of  strong  and  weak  clusters.  The  matrix  Bs 
models  the  distribution  of  edges  between  the  nodes  belonging  to  the  strong 
clusters,  while  the  matrix  Bw  has  the  corresponding  role  for  the  weak  clusters. 
The  matrix  Bsw  models  the  interaction  between  the  strong  and  weak  clusters. 
For  ease  of  analysis,  we  make  the  following  simplifying  assumptions  :  Assume 
that  pk,n  =  Pni  for  k  =  1, ...  K,  and  that  the  strong  clusters  C\,  . . . ,  Ck  have 
equal  sizes,  that  is,  assume  rq:  =  ns  for  k  =  1, . . . ,  K. 

Let  bsw  be  defined  as  the  maximum  of  the  elements  in  Bsw,  and  let  nw  be 
the  number  of  nodes  belonging  to  a  weak  cluster.  In  other  words,  Kns +nw  =  n. 
We  make  the  following  three  assumptions: 


(Pn  -  Qn)2 

Pn 


grows  faster  than 


logn 

n 


(24) 


nw  =  0(1). 

6,.  < 

V  n 


(25) 

(26) 


Assumption  (24)  ensures  recovery  of  the  strong  clusters  if  there  were  no 
nodes  belonging  to  weak  clusters  (See  Corollary  6  or  McSherry  [18],  Corollary 
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1).  Assumption  (25)  and  (26)  pertain  to  the  nodes  in  the  weak  clusters.  In  par¬ 
ticular,  Assumption  (25)  simply  states  that  the  total  number  of  nodes  belonging 
to  a  weak  cluster  is  constant  and  does  not  grow  with  n.  Assumption  (26)  states 
that  the  density  of  the  edges  between  the  strong  and  weak  clusters,  denoted  by 
b3W1  is  not  too  large. 

We  only  assume  that  the  rank  of  Bs  is  K.  Thus,  the  rank  of  B  is  at  least  K. 
As  before,  we  assume  that  K  is  known  and  does  not  grow  with  n.  The  number 
of  weak  clusters,  Kw,  need  not  be  known  and  and  could  be  as  high  as  nw.  We 
do  not  even  place  any  restriction  on  the  sizes  of  a  weak  cluster.  Indeed,  we  even 
entertain  the  case  that  each  of  the  I\w  clusters  has  one  node.  Consequently,  we 
are  only  interested  in  recovering  the  strong  clusters. 

Theorem  7  presents  our  theorem  for  the  recovery  of  the  K  strong  clusters 
using  the  RSC-r„  Algorithm,  with  {rn,  n  >  1},  satisfying 


npsn  log  n 
Tn 


o(l) 


(27) 


In  other  words,  the  regularization  parameter  is  taken  to  grow  faster  than  np*  log  n, 
where  notice  that  npsn  is  of  the  same  order  of  the  expected  maximum  degree  of 
the  graph.  Let  Ti, ,  Tk  be  the  clusters  outputted  from  the  RSC-r„  Algorithm. 
Let 

,  .  \cknfzw\  +  \cznTHk)\ 

j  =  mm  max - , 

7T  k  Tlk 

be  as  in  (6).  Notice  that  the  clusters  C\, . . .  ,Ck  do  not  form  a  partition  of 
{1, . . . ,  n},  while  the  estimates  T),  . . . ,  TK  do.  However,  since  nw  does  not 
grow  with  n  this  should  not  make  much  of  a  difference. 

Theorem  7.  Let  Assumptions  (24),  (25)  and  (26)  be  satisfied.  If  {r„,  n  >  1} 
satisfies  (27)  then  the  clustering  error  f  for  RSC-rn  goes  to  zero  with  probability 
tending  to  one. 


The  theorem  is  proved  in  Appendix  C.  It  states  that  under  Assumption  (24) 
-  (26)  one  can  can  get  the  same  results  with  regularization  that  one  would  get 
if  the  nodes  belonging  to  the  weak  clusters  weren’t  present. 

Spectral  clustering  (with  r  =  0)  may  fail  under  the  above  assumptions. 
This  is  elucidated  in  Figure  3.  Here  n  =  2000  and  there  are  two  strong  clusters 
(K  =  2)  and  three  weak  clusters  (M  =  3).  The  first  1600  nodes  are  evenly  split 
between  the  two  strong  clusters,  with  the  remaining  nodes  split  evenly  between 
the  weak  clusters.  The  matrix  Bs  and  Bw  are  as  in  (28)  and  Bsw  is  a  matrix 
with  all  entries  .015. 


Bs  =  [ 

'  .025 

.015  \ 

Bw 

(  .007 
.015 

.015 

.0071 

.015 

.015 

\ 

.015 

\ 

.025  J 

l  .015 

.015 

.0069 

The  nodes  in  the  weak  clusters  have  relatively  lower  degrees,  and  consequently, 
cannot  be  recovered.  Figures  3(a)  and  3(b)  show  the  first  3  eigenvectors  of  the 
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Figure  3:  First  three  population  eigenvectors  corresponding  to  Bs  and  Bw  in 
(28).  In  both  plots,  the  x-axis  provides  the  node  indices  while  the  y-axis  gives  the 
eigenvector  values.  The  regularization  parameter  was  taken  to  be  n.  The  shaded 
blue  and  pink  regions  corresponds  to  the  nodes  belonging  to  the  two  strong 
clusters.  The  solid  red  line,  solid  blue  line  and  —  x—  black  lines  correspond  to 
the  first,  second  and  third  population  eigenvectors  respectively. 


population  Laplacian  in  the  regularized  and  unregularized  cases.  We  plot  the 
first  3  instead  of  the  first  5  eigenvectors  in  order  to  facilitate  understanding  of 
the  plot.  In  both  cases  the  first  eigenvector  is  not  able  to  distinguish  between  the 
two  strong  clusters.  This  makes  sense  since  the  first  eigenvector  of  the  Laplacian 
has  elements  whose  magnitude  is  proportional  to  square  root  of  the  population 
degrees  (see,  for  example,  [26]  for  a  proof  of  this  fact).  Consequently,  as  the 
population  degrees  are  the  same  for  the  two  strong  clusters,  the  values  for  this 
eigenvector  is  constant  for  nodes  belonging  to  the  strong  clusters. 

The  situation  is  different  for  the  second  population  eigenvector.  In  the  reg¬ 
ularized  case,  the  second  eigenvector  is  able  to  distinguish  between  these  two 
clusters.  However,  this  is  not  the  case  for  the  unregularized  case.  From  Figure 
3(a),  not  even  the  third  unregularized  eigenvector  is  able  to  distinguish  between 
the  strong  and  weak  clusters.  Indeed,  it  is  only  the  fifth  eigenvector  that  dis¬ 
tinguishes  between  the  two  strong  clusters  in  the  unregularized  case. 

In  Figure  4(a)  and  4(b)  we  show  the  second  sample  eigenvector  for  the  two 
cases  in  Figure  3(a)  and  3(b).  Note,  we  do  not  show  the  first  sample  eigenvector 
since  from  Figure  3(a)  and  3(b),  the  corresponding  population  eigenvectors  are 
not  able  to  distinguish  between  the  two  strong  clusters.  As  expected,  it  is  only 
for  the  regularized  case  that  one  sees  that  the  second  eigenvector  is  able  to  do  a 
good  job  in  separating  the  two  strong  clusters.  Running  iL-means,  with  k  =  2, 
resulted  in  a  mis-classification  of  49%  of  the  nodes  in  the  strong  clusters  in  the 
unregularized  case,  compared  with  16.25%  in  the  regularized  case. 
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(a)  Unregularized 


(b)  Regularized 


Figure  4:  Second  sample  eigenvector  corresponding  to  situation  in  Figure  3.  As 
before,  in  both  plots,  the  x-axis  provides  the  node  indices,  while  the  y-axis  gives 
the  eigenvector  values.  As  before,  the  shaded  blue  and  pink  regions  corresponds 
to  the  nodes  belonging  to  the  two  strong  clusters.  For  plots  (a)  &  (b)  the  blue 
line  correspond  to  the  second  eigenvector  of  the  respective  sample  Laplacian 
matrices. 


5  DKest  :  Data  dependent  choice  of  r 


The  results  Sections  3  and  4  theoretically  examined  the  gains  from  regulariza¬ 
tion  for  large  values  of  regularization  parameter  r.  Those  results  do  not  rule 
out  the  possibility  that  intermediate  values  of  r  may  lead  to  better  clustering 
performance.  In  this  section  we  propose  a  data  dependent  scheme  to  select  the 
regularization  parameter.  We  compare  it  with  the  scheme  in  [8]  that  uses  the 
Girvan-Newman  modularity  [6].  We  use  the  widely  used  normalized  mutual 
information  criterion  (NMI)  [2] ,  [27]  to  quantify  the  performance  of  the  spectral 
clustering  algorithm  in  terms  of  closeness  of  the  estimated  clusters  to  the  true 
clusters. 

Our  scheme  works  by  directly  estimating  the  quantity  in  (10)  in  the  fol¬ 
lowing  manner:  For  each  r  in  grid,  an  estimate  ££T  of  Jz?T  is  obtained  using 
clusters  outputted  from  the  RSC-r  algorithm.  In  particular,  let  CitT,  . . . ,  Ck,t 
be  the  estimates  of  the  clusters  C i, . . . ,  Ck  produced  from  running  RSC-r.  The 
estimate  J£T  is  taken  as  the  population  regularized  Laplacian  corresponding  to 
an  estimated  block  probability  matrix  B  and  clusters  Ci>r,  ...,  Ck,t ■  More 
specifically,  the  (Aq,  fc2)-th  entry  of  B  is  taken  as 


B 


E 


ki,k2  — 


i(zCki,r  )  j€Ck2 


|C'fcliT||C'fc2)T 


(29) 


The  above  is  simply  the  proportion  of  edges  between  the  nodes  in  the  cluster 
estimates  Ckt,T  and  Ck2}T ■  The  following  statistic  is  then  considered: 


DKestT 


\\Lr-&r\\ 

UK  (-^t) 


(30) 
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where  jJK  denotes  the  the  iv-th  smallest  eigenvalue  of  Jz?T.  The  r  that 

minimizes  the  DKestT  criterion  is  then  chosen.  Since  this  criterion  provides  an 
estimate  of  the  Davis-Kahan  bound,  we  call  it  the  DKest  criterion. 

We  compare  the  above  to  the  scheme  that  uses  Girvan-Newman  modularity 
[6],  [19],  as  suggested  in  [8].  For  a  particular  r  in  the  grid  the  Girvan-Newman 
modularity  is  computed  for  the  clusters  outputted  using  the  RSC-r  Algorithm. 
The  r  that  maximizes  the  modularity  value  over  the  grid  is  then  chosen. 

Notice  that  the  best  possible  choice  of  r  would  be  the  one  that  simply  max¬ 
imizes  the  NMI  over  the  selected  grid.  However,  this  cannot  be  computed  in 
practice  since  calculation  of  the  NMI  requires  knowledge  of  the  true  clusters. 
Nevertheless,  this  provides  a  useful  benchmark  against  which  one  can  compare 
the  other  two  schemes.  We  call  this  the  ‘oracle’  scheme. 

5.1  Simulation  Results 

Figure  5  provides  results  comparing  the  three  schemes,  viz.  DKest ,  Girvan- 
Newman  and  ‘oracle’  schemes.  We  perform  simulations  following  the  pattern 
of  [2],  In  particular,  for  a  graph  with  n  nodes  we  take  the  K  clusters  to  be  of 
equal  sizes.  The  K  x  K  block  probability  matrix  is  taken  to  be  of  the  form 


/  /3uq  1 

...  i  \ 

B  =  fac 

1  / 3W2 

i 

{  : 

1  I3wk  ) 

Here,  the  vector  w  =  (izq, . . .  ,wk),  which  are  the  inside  weights,  denotes  the 
relative  degrees  of  nodes  within  the  communities.  Further,  the  quantity  (3,  which 
is  the  out-in  ratio,  represents  the  ratio  of  the  probability  of  an  edge  between 
nodes  from  different  communities  to  that  of  probability  of  edge  between  nodes 
in  the  same  community.  The  scalar  parameter  fac  is  chosen  so  that  the  average 
expected  degree  of  the  graph  is  equal  to  A. 

Figure  5  compares  the  two  methods  of  choosing  the  best  r  for  various  choices 
of  n,  K,  f3,  w  and  A.  In  general,  we  see  that  the  DKest  selection  procedure 
performs  at  least  as  well,  and  in  some  cases  much  better,  than  the  procedure 
that  used  the  Girvan-Newman  modularity.  The  performance  of  the  two  methods 
is  much  closer  when  the  average  degree  is  small. 

5.2  Analysis  of  the  Political  Blogs  dataset 

Here  we  investigate  the  performance  of  DKest  on  the  well  studied  network  of 
political  blogs  [1].  The  data  set  aims  to  study  the  degree  of  interaction  between 
liberal  and  conservative  blogs  over  a  period  prior  to  the  2004  U.S  Presidential 
Election.  The  nodes  in  the  networks  are  select  conservative  and  liberal  blog 
sites.  While  the  original  data  set  had  directed  edges  corresponding  to  hyperlinks 
between  the  blog  sites,  we  converted  it  to  an  undirected  graph  by  connecting 
two  nodes  with  an  edge  if  there  is  at  least  one  hyperlink  from  one  node  to  the 
other. 
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Regularization  parameter 


Figure  5:  Performance  of  spectral  clustering  as  a  function  of  r  for  stochastic 
block  model  for  A  values  of  30,  20  and  10.  In  the  plots  we  denote  /3  and  w  as  01 R 
and  In  Wei  respectively.  The  right  j/-axis  provides  values  for  the  Girvan-Newman 
modularities  and  DKest  functions,  while  the  left  y-axis  provides  values  for  the 
normalized  mutual  information  (NMI) .  The  3  labeled  dots  correspond  to  values 
of  the  NMI  at  r  values  which  minimizes  the  DKest,  and  maximizes  the  Girvan- 
Newman  modularity  and  the  NMI.  Note,  the  oracle  r,  or  the  r  that  maximizes 
the  NMI,  cannot  be  calculated  in  practice. 


The  data  set  has  1222  nodes  with  an  average  degree  of  27.  Spectral  clustering 
(r  =  0)  resulted  in  only  51%  of  the  nodes  correctly  classified  as  liberal  or 
conservative.  The  oracle  procedure,  with  r  =  0.5,  resulted  in  95%  of  the  nodes 
correctly  classified.  The  DKest  procedure  selected  r  =  2.25,  with  an  accuracy 
of  81%.  The  Girvan-Newman  (GN)  procedure,  in  this  case,  outperforms  the 
DKest  procedure  providing  the  same  accuracy  as  the  oracle  procedure.  Figure  6 
illustrates  these  findings.  As  predicted  by  our  theory,  the  performance  becomes 
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Figure  6: 


Performance  of  the  three  schemes  for  the  political  blogs  data  set  [1] . 


Figure  7:  Second  eigenvector  of  the  unregularized  and  regularized  Laplacians  for 
the  political  blogs  data  set  [1].  The  shaded  blue  and  pink  regions  corresponds 
to  the  nodes  belonging  to  the  liberal  and  conservative  blogs  respectively. 


insensitive  for  large  t.  In  this  case  70%  of  the  nodes  are  correctly  clustered  for 
large  r. 

We  remark  that  the  DKest  procedure  does  not  perform  as  well  as  the  GN 
procedure  most  likely  because  our  estimate  J£T  in  (30)  assumes  that  the  data 
is  generated  from  an  SBM,  which  is  a  poor  model  for  the  data  due  to  the 
large  heterogeneity  in  the  node  degrees.  A  better  model  for  the  data  would  be 
the  degree  corrected  stochastic  block  model  (D-SBM)  proposed  by  Karrer  and 
Newman  [14].  If  we  use  D-SBM  based  estimaes  in  DKest  then  the  selection  of 
r  matches  that  of  the  GN  Newman  and  the  oracle  procedure.  See  Section  6  for 
a  discussion  on  this. 

The  results  of  Section  4  also  explain  why  unregularized  spectral  clustering 
performs  badly  (see  Figure  2).  The  first  eigenvector  in  both  cases  (regularized 
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Figure  8:  Third  eigenvector  of  the  unregularized  Laplacian. 

and  unregularized)  does  not  discriminate  between  the  two  clusters.  In  Figure  7, 
we  plot  the  second  eigenvector  of  the  regularized  and  unregularized  Laplacians. 
The  second  eigenvector  is  able  to  discriminate  between  the  clusters  in  the  reg¬ 
ularized  case,  while  it  fails  to  do  so  in  without  regularization.  Indeed,  it  is  only 
the  third  eigenvector  in  the  unregularized  case  that  distinguishes  between  the 
clusters,  as  shown  in  Figure  8. 


6  Discussion 

The  paper  provides  a  theoretical  justification  for  regularization.  In  particular, 
we  show  why  choosing  a  large  regularization  parameter  can  lead  to  good  results. 
The  paper  also  partly  explains  empirical  findings  in  Amini  et  al.  [2]  showing  that 
the  performance  of  regularized  spectral  clustering  becomes  insensitive  for  larger 
values  of  regularization  parameters.  It  is  unclear  at  this  stage  whether  the 
benefits  of  regularization,  resulting  from  the  trade-offs  between  the  eigen  gap 
and  the  concentration  bound,  hold  for  the  regularization  in  [7],  [22]  as  they  hold 
for  the  regularization  in  Amini  et  al.  [2]  (as  demonstrated  in  Sections  3  and  4). 

Even  though  our  theoretical  results  focus  on  larger  values  of  the  regular¬ 
ization  parameter  it  is  very  likely  that  intermediate  values  of  r  produce  better 
clustering  performance.  Consequently,  we  propose  a  data-driven  methodology 
for  choosing  the  regularization  parameter.  We  hope  to  quantify  theoretically 
the  gains  from  using  intermediate  values  of  the  regularization  parameter  in  a 
future  work. 

For  the  extension  of  the  SBM  proposed  in  Section  4,  if  the  rank  of  B,  given 
by  (23),  is  K  then  the  model  encompasses  specific  degree-corrected  stochastic 
block  models  (D-SBM)  [14]  where  the  edge  probability  matrix  takes  the  form 


P  =  QZBZ'Q. 


Here  0  =  diag(9 1,  . . . ,  9n)  models  the  heterogeneity  in  the  degrees.  In  particu¬ 
lar,  consider  a  A'-block  D-SBM  with  0  <  0,;  <  1,  for  each  i.  Assume  that  9i  =  1 
for  the  most  of  the  nodes.  Take  the  nodes  in  the  strong  clusters  to  be  those 
with  9,;  =  1.  The  nodes  in  the  strong  clusters  are  associated  to  one  of  K  clusters 
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depending  on  the  cluster  they  belong  to  in  the  D-SBM.  The  remaining  nodes  are 
taken  to  be  in  the  weak  clusters.  Assumptions  (25)  and  (26)  puts  constraints 
on  the  6i  s  which  allows  one  to  distinguish  between  the  strong  clusters  via  reg¬ 
ularization.  It  would  be  interesting  to  investigate  the  effect  of  regularization  in 
more  general  versions  of  the  D-SBM,  especially  where  there  are  high  as  well  as 
low  degree  nodes. 

The  DKest  methodology  for  choosing  the  regularization  parameter  works 
by  providing  estimates  of  the  population  Laplacian  assuming  that  the  data  is 
drawn  from  an  SBM.  From  our  simulations,  it  is  seen  that  the  performance  of 
DKest  does  not  change  much  if  we  take  the  matrix  norm  in  the  numerator  of 
(30)  to  be  the  Frobenius  norm,  which  is  much  faster  to  compute. 

It  is  seen  that  the  performance  of  DKest  improves  for  the  political  blogs 
data  set  by  taking  5£T  to  be  the  estimate  assuming  that  the  data  is  drawn  from 
the  more  flexible  D-SBM.  Indeed,  if  we  take  Jz?T  to  be  such  an  estimate  then 
the  performance  of  DKest  is  seen  to  be  as  good  as  the  oracle  scheme  (and  the 
GN  scheme)  for  this  data  set.  We  describe  how  we  construct  this  estimate  in 
Appendix  D. 
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A  Analysis  of  SBM  with  K  blocks 

Throughout  this  section  we  assume  that  we  have  samples  from  a  K  block  SBM. 
Denote  the  sample  and  population  regularized  Laplacian  as  LT,  T£t  respectively. 
For  ease  of  notation,  we  remove  the  subscript  r  from  the  various  matrices  such 
as  Lt,  2z?t,  ATiDt,  3>t.  We  also  remove  the  subscript  r  in  the  dijT,  dj.r’s  and 
denote  these  as  di,  di  respectively.  However,  in  some  situations  we  may  need  to 
refer  to  these  quantities  at  t  =  0.  In  such  cases,  we  make  this  clear  by  writing 
them  as  for  *  =  1, . . . ,  n  and  di$  for  *  =  1, ... ,  n. 

We  need  probabilistic  bounds  on  the  weigthed  sum  of  Bernoulli  random 
variables.  The  following  lemma  is  proved  in  [13]. 

Lemma  8.  Let  Wj,  1  <  j  <  N  be  N  independent  Bernoulli(rj)  random  vari¬ 
ables.  Furthermore,  let  aj,  1  <  j  <  N  be  non-negative  weights  that  sum  to  1 
and  let  Na  =  1/ma xjctj.  Then  the  weighted  sum  r  =  which  has 

mean  given  by  r*  =  JN  ctjVj,  satisfies  the  following  large  deviation  inequalities. 
For  any  r  with  0  <  r  <  r* , 

P(r  <  r)  <  exp  {—NaD(r\\r*)}  (31) 
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and  for  any  f  with  r*  <  f  <  1 , 


P(r  >  f)  <  exp  {— NaD(f\\r*)} 


(32) 


where  D(r\\r*)  denotes  the  relative  entropy  between  Bernoulli  random  variables 
of  success  parameters  r  and  r* . 

The  following  is  an  immediate  corollary  of  the  above. 

Corollary  9.  Let  Wj  be  as  in  Lemma  8.  Let  fdj,  for  j  =  1  be  non¬ 

negative  weights,  and  let 

N 

w  =  Ew- 

j= i 

Then, 

P(W-E(W)>S><vv{-2mfjl3j{E{f]  +  i)}  (33) 

and 

P(W-E(W)<-6)<ex  p(-  1  ,  *)  (34) 

2  ma xj  pj  Jb(W)  J 

Proof.  Here  we  use  the  fact  that 

D(r\\r*)  >  (r  —  r*)2/(2r),  (35) 

for  any  0  <  r,  r*  <  1.  We  prove  (33).  The  proof  of  (34)  is  similar.  The  event 
under  consideration  may  be  written  as 

{r  —  r*  >  <5}, 

where  f  =  W/  J2j  Pj>  r*  =  E(W)/  J2j  Pj  and  ^  Pj-  Correspondingly, 

using  Lemma  8  and  (35),  one  gets  that 


P(W-  E(W)  >S)<  exp 


E  Pi  \ 

ma Xj  /3j  2(r*  +6)  j  ' 


Substituting  the  values  of  5  and  r*  results  in  bound  (33).  □ 

The  following  lemma  provides  high  probability  bounds  on  the  degree.  Let 
Tmin  =  ma x{dm.in,n,  clogn}  and  Si<c  =  max{d,;i0,  clogn}. 

Lemma  10.  On  a  set  E\  of  probability  at  most  1  —  2/nCl~1,  one  has 

| ditT  —  dijT  |  <  c2\A:, dogn  for  each  i  =  1, . . .  ,n., 

where  c\  =  .5c|/(l  +  c2/ y/c). 
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Proof.  Use  the  fact  that  dj>T  —  dj)T  =  d^o  —  d^o,  and 


P(|dij0  -  <4o|  <  C2\J di>clog n  Vi)  <  ^  p(\di, o  -  di)0|  <  c2^/6iiClogn) 

i= 1 

Notice  that  d^o  =  Apply  Corollary  9  with  f3j  =  1  and  Wj  =  Aij, 

and  5  =  c2\ZTmin  log  n  to  bound  each  term  in  the  sum  of  the  right  side  of  the 
above  equation. 

The  error  exponent  can  be  bounded  by, 

2"eXPH(,E(W0 +  ,!)}■  (36) 

We  claim  that, 

E(W)  +  S  <  (1  +  c2/V~c)5z,c.  (37) 

Substituting  the  above  bound  in  the  error  exponent  (36)  will  complete  the  proof. 

To  see  the  claim,  notice  that  E(W)  =  d,,o.  Now,  consider  the  case  d,,o  > 
clogn.  In  this  case,  </iC  =  d^ o  and  logn  <  d^o/c.  Correspondingly,  E(W)  +  5 
is  at  most  d,)o(l  +  c2/\fc). 

Next,  consider  the  case  do.-,  <  clogn.  In  this  case  5tiC  =  Tmin,  which  is 
clog?r.  Consequently, 

E(W)  +  5  <  clogn  +  c2y/c\ogn. 

The  right  side  of  the  above  can  be  bounded  by  (l  +  C2/\/c)(clogn).  This  proves 
the  claim.  □ 

A.l  Concentration  of  Laplacian 

Below  we  provide  the  proof  of  Theorem  4.  Throughout  this  section  we  assume 
that  the  quantities  c,  c2  appearing  in  Lemma  10  are  given  by  c  =  32  and  c2  = 
2\/2.  Notice  that  this  makes  c\  >  2,  where  c\  as  in  Lemma  10. 

From  Lemma  10,  with  probability  at  least  1  —  n_1, 

max  |  dj  —  dj|/dj  <  maxc2y/dijC  log  n/di 

i  i  ’ 

We  claim  that  the  right  side  of  the  above  is  at  most  1/2.  To  see  this  notice  that 

\/Si,c  log  n/ di  <  y/6i,c  log  n/ Si:C 
=  x/log 

<  iM 

Here  the  first  inequality  follows  from  noting  that  di  =  di^  +  r,  which  is  at  most 
maxjd^o, clogn},  using  r  >  clogn.  The  third  inequality  follows  from  using 
di,c  >  clogn.  Consequently,  max;  d,  —  dj|/dj  <1/2  using  c2  =  2\f2  and  c  =  32. 
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Proof  of  Theorem  A.  Our  proof  lias  parallels  with  the  proof  in  [211.  Write  L  = 
Then, 


||L-jSf||<||L-L||  +  ||L-^||. 

We  first  bound  ||P  —  P||.  Let  F  =  F)1/2^-1/2.  Then  L  =  FLF.  Correspond¬ 
ingly, 


||P -P||  <  \\L-FL\\  +  \\FL-L\\ 

<  ||/-P||||L||  +  ||P||||P||||/-P|| 
<||7-P||(2+||/-F||)  (38) 


Notice  that 

F  -  I  =  (I  +  (D  -  g)®-1)1'2  -  I. 

Further,  using  max.j  |dj  —  di\/di  <  1/2,  and  the  fact  that  \J\  +  x  —  1  <  x  for 
x  £  [—3/4,  3/4],  as  in  [21],  one  gets  that 

II F  — 

di 

with  high  probability.  Consequently,  using  (38),  one  gets  that 


|P  —  P||  <  C2  max 


\J $i,c  log  n 


2  +  C2  max 


\J di,c  logn' 


with  probability  at  least  1  —  l/nCl  1. 


y/$i,c  _ 

max  —  <  er,n  = 


,  if  r  <  2  d„ 


dmax.n  +  T/2  > 


if  r  >  2 dn 


(39) 


To  see  this  notice,  that  <5j)C  <  d^o  +  r  =  di,  using  max{r,  d,:,o} 
Consequently  ,  y/d///dj  <  1/y^o  +  t.  which  is  at  most  l/y/dmm,ra 
Further, 


max 

i 


< 


\fdn 


+  T 


>  c  log  n. 

+  T. 


for  r  >  dmax,n ■  This  is  atmost  eT,n  for  r  >  dmax,n 
Consequently,  from  (39),  one  gets  that 


||P-P||  <  c2eT,n\/logn  (2  +  c2/\fc)  (40) 

with  probability  at  least  1  —  1/n01"1. 

Next,  we  bound  ||P  —  We  get  high  probability  bounds  on  this  quantity 
using  results  in  [21],  [17].  In  particular,  as  in  [21], 


~L-<e  =  Yy* 

i<j 
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where  =  3  1//2Xj j3  1^2,  with 

Y  j  (Aij  —  Pij)  (eiej  +  ejei  )  )  if  *  7^  j 

13  \  ( Aij  -  Pij)eief  if  i  =  j 

Further,  ||Yjj||  <  l/{dmin>n  +  r).  Let  tr2  =  ||  X);<j  E{Y/j)\\.  We  claim  that 
cr2  <  e2n.  As  in  [21],  page  15,  notice  that, 


Clearly, 


'y  (  p/v2 

i<j 


Pij  ( 1  Pij  ) 

difi  +  T  dj, o  +  r 


£«?>  =  E  J-TT '  £ 


ei,e,- 


y^  Pij  ( 1  ~  Pij )  j  <  dito 
"  dj]0  +  r 

7  =  1  J  ’ 


drain, n  H“ 


(41) 


Consequently,  for  each  i  the  right  side  of  (41)  is  at  most  l/(dmm,n  +  t)  leading 
to  the  fact  that  cr2  <  1  /(dmm,n  +  t). 

For  r  >  2dmax,n  we  can  get  improvements  in  the  bound  for  cr2.  By  using 
the  fact  that  dji o  +  r  >  dmax,n  +  t/2  for  r  >  2dmaXtn,  one  gets  that 


|  y^  Pjj(  1  —  Pij)  j  <  djfl 
l  dj.Q  +  T  J  dmax^n  ~\~  T / 2 

for  r  >  2dmax,n-  Consequently,  using  difi/ (dlfi  +  t)  <  dmaXtn/(dmaXtn  +  T),  one 
gets  that  cr“  A  dmax?n/(dmaa;in  t/2')  for  t  >  2dmax,n. 

Applying  Corollary  4.2  in  [17]  one  gets 

p(||Lo-JS?o||  >t)  <ne-(2/2CT2. 

Consequently,  with  probability  at  least  1  —  l/nCl_1  one  has, 

y  ^min, n 

Thus,  with  probability  at  least  1  —  l/nCl_1,  one  has 

\\L-Jf\\  <  \j2c\  logn  eT,n.  (42) 

As  a  result,  combining  (40)  and  (42),  one  gets  that  with  probability  at  least 
1  —  2/nCl_1,  one  has 

|| Lt  -  S£t ||  <  yjlogn  eTtn  [V2ci  +  c2  ( 2  +  (c2/\/c))] 

Substituting  the  values  of  c2,  c,  and  noting  that  C\  >  2  one  gets  the  expression 
in  the  theorem.  □ 
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A. 2  Proof  of  Lemma  1 


Notice  that  the  population  regularized  Laplacian  SZT  corresponds  to  the  popu¬ 
lation  Laplacian  of  an  ordinary  stochastic  block  model  with  block  probability 
matrix 

BT  =  B  +  vv', 

where  v  =  (•*/ r/n)l.  Correspondingly,  we  can  use  the  following  facts  of  the 
population  eigenvectors  and  eigenvalues  given  for  a  SBM. 

Let  Z  be  the  community  membership  matrix,  that  is,  the  n  x  K  matrix  with 
entry  ( i,k )  being  1  if  node  i  belongs  to  cluster  Ck ■  The  following  is  proved  in 
[23]  : 

1.  Let  R  =  S’-1 .  Then,  the  non-zero  eigen  values  of  Jz?T  are  the  same  as  that 
of 

Beig  =  Bt(Z'RZ),  (43) 

or  equivalently,  Beig  =  (Z1  RZ)1/2BT{Z' RZ)1/2 . 

2.  Define  /./  =  R1/2  Z{Z' RZ)~X/2 .  Let, 


Beig  =  HAHt , 


where  the  right  side  of  the  above  gives  the  singular  value  decomposition 
of  the  matrix  on  the  right.  Then  the  eigenvectors  of  are  given  by  /xH . 


Further,  since  in  the  stochastic  block  model  the  expected  node  degrees  are  the 
same  for  all  nodes  in  a  particular  cluster,  one  can  write  R!2Z  =  ZQ1  where 
Q~2  is  the  K  x  K  diagonal  matrix  of  population  degrees  of  nodes  in  a  particular 
community.  Consequently,  one  sees  that 

fiH  =  Z{ZtZ)~1/2H. 


Lemma  1  follows  from  noting  that 

I. iH{hH)t  =  z{zTzyxzT 

and  the  fact  that  ( ZT Z)~ 1  =  diag{\/n\ , . . . ,  l/n^). 


B  Proof  of  Theorem  5 


We  first  prove  (16).  Recall  that  ST,n  is  the  limit  of  eT,n/ Hk,t,  as  r  — >  oo.  Now 
reTiH  converges  to  20^ dmaXtn  log?i.  Consequently,  we  now  show  that 


r  1  fh\  nTfl\  n  ^-2  ,n 

Inn  -  x  - - - x 

rx-oo  tij,k,t  mi.n 


(44) 


Recall  that  g k.t  is  the  Ji-th  smallest  eigenvalue  of  Be ig  (43).  Now, 

1 

^  trace(B -£)' 
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The  above  follows  from  noting  that  /-Ik,t  is  also  equal  to  the  inverse  of  the  largest 
eigenvalue  of  B~g,  and  the  fact  that  the  latter  is  x  trace(B~ig),  as  I\  is  fixed. 
We  now  proceed  to  show  that  trace{B~^)/r  converges  to  a  quantity  that  is  of 
the  same  order  of  magnitude  as  the  right  side  of  (44).  This  will  prove  (16). 

Recall  that  the  block  probability  matrix  B  is  given  by  (14).  We  first  consider 
the  case  that  qn  =  0,  that  is,  there  is  no  interaction  between  the  clusters.  Notice, 


B-i]  =  F-\B  +  ■ 


where 


F  1  =  diag 


7i  +  T  7  k  +  t 

,  .  .  .  , 


n  i 


nk 


Here,  for  convenience,  we  remove  the  subscript  n  from  quantities  such  as  7 j>n. 
Using  Sherman-Morrison  formula 


One  sees  that,  B  1v  =  \J T/n(l/pi , . . . ,  1  /pk)'  ■  Correspondingly, 
v'B~xv  =  -  V'  l/pi  =  ™i  n, 

i 


using  qn  =  0.  Further,  the  diagonal  entries  of  the  matrix  ( B  1v)(B  1v)r  can 
be  written  as 

^diag{l/pl,...,l/p2K). 

We  need  the  trace  of  B~ig.  Using  the  above,  one  sees  that 


trace(Beil)  =  ^ 

k 


Jk  +  T  TTOijn  +  T2m2y 


7  k 


1  +  TUlir 


Since  K  is  fixed,  we  have, 


trace(Beig)  x  rm - 


TWl,n  +  TZm2,n 
1  +  Tmltn 


Thus,  as  r  — >  00,  one  gets  that, 
trace{B~ J) 


converges  to  rhi;„  —  m2tn/mitr, 


(45) 


The  right  side  of  the  above  is  positive,  as  fhi)nmii7l  >  m2>n,  for  K  >  1. 

Now  consider  the  K  block  model  with  off-diagonal  elements  of  B  equal  to  q. 
Notice  that 

Bt  =  B0  +v(v)T, 


where  Bq  =  diag(pi  —  q, . . .  —  q)  and  v  =  f/nl,  where  f  =  r  +  nq.  Thus 

applying  the  above  result  for  the  diagonal  block  model  one  gets  that  if  r  tends 
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to  infinity,  the  quantity  trace(B~ig) / V  converges  to  miiti  —  m2,»/mi,n,  where 
here  7 k  =  nk{pk  —  <?)•  This  proves  (16). 

We  now  prove  that  RSC-r„  provides  consistent  cluster  estimates  for  {r„,  n  > 
1}  satisfying  (15).  We  need  to  show  that  eTrij„/ px,Tn  goes  to  zero. 

First,  notice  that  TneTri  n  <  \Jdmax^n  log?r.  Consequently,  from  the  above, 
we  need  to  show  that  trace(l?“g)y/d(()))(((^do gn/rn  is  o(l)  if  Sn  =  o(l).  From 
(45)  one  has 


trace(Beig)y/dn 


logn 


V~d 

max,n  log  ft- 


777-1 ?n  ^77-1, n 
1  +  Tnm1>n 


+  (■ Tnmltn ) 


fhl  ,n  ^2,n/^l,n 
1  +  r„?ni>ra 


The  second  term  is  bounded  by  Sn,  which,  by  assumption,  goes  to  zero.  The  first 
term  is  bounded  by  y7 dmaX)n  log  n  m1>n/ (mi,nrn).  Noting  that  fh1>n/m1:n  < 
)T)fc  1/wfc,  one  gets  that  the  second  terms  also  goes  to  0,  as  r„  satisfies  (15). 


B.l  Proof  of  Corollary  6 

For  the  A"-block  SBM,  let  Tk  =  lK,n/lK-\,n-  Notice  that  tk  ~  (pk-  1  — 
q)/{pK  ~  q)  using  wk  x  l.Use  the  fact  that  mi,n  =  {^-/riK,n){wK  +  0(rK)), 
rhhn  =  (l/7x,n)(l  +  O(rjf))  and  m2,„  =  (1/7*t,„)(^k  +  °(?w)),  to  get  that 


(wi.nMl.n  ~  m2,n) 


0(l/7if-l,n)- 


Consequently,  =  0{^/ dmax,n  log n/^K-\,n)-  The  proof  of  claim  (20)  is  com¬ 
pleted  by  noting  that  7x--i,n  ~  n  (pk-i  -  q)  and  dmax,n  ~  np1>n. 

For  the  2-block  SBM  we  show  that 


\J  d-rt 


,  log  n 


W\W2  [(Pl,n  +P2,n)/2  -  qn]  ' 


(46) 


Expression  (46)  follows  from  using  (16)  and  noting  that 


(mi,nmi,n  -  m2,n) 
mi,n 


1 

W2ll,n  +  «fi72,n 


for  the  two-block  model.  It  is  seen  that 


^271  ,n  +  Wll2,n  =  2nWiW2  [{p l,„  +  P2,n)  / 2  -  qn]  ■ 

Notice  that  W\W2  x  min{uq,  w2}.  Consequently,  (21)  follows  from  noting  that 
when  p2,n  =  qn  then  dmax,n  =  n(wipi>n  +  w2qn). 


C  Proof  of  Results  in  Section  4 

In  this  section  we  provide  the  proof  Theorem  7,  along  with  Lemmas  11  and  12 
required  in  proving  the  theorem. 
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C.l  Proof  of  Theorem  7 

Denote  Cw  as  the  set  of  nodes  belonging  to  the  weak  clusters.  We  club  all  the 
nodes  belonging  to  the  weak  clusters  into  the  cluster  Ck  and  call  this  combined 
cluster  as  Ck,  that  is  Ck  =  CkCCw .  For  consistency  of  notation,  let  Ck  =  Ck, 
for  1  <  k  <  K  —  1,  and  let  fik  =  \Ck\,  for  k  =  1,  . . . ,  K. 

Denote 

.  nnfz(k)\  +  \Cint„w\ 

j  =  mm  max -  _ - . 

n  k  flk 

It  is  not  hard  to  see  that, 


n 


/  ^  1  +  /  +  tt- 


na 


Consequently,  a  demonstration  the  /  goes  to  zero,  along  with  the  fact  that 
nw  =  0(1),  will  show  that  /  goes  to  zero. 

We  now  show  that  /  goes  to  zero  with  high  probability.  For  a  given  as¬ 
signment  of  nodes  in  one  of  the  K  +  Kw  clusters  we  denote  LT,  Jz?T  to  be  the 
sample,  population  regularized  Laplacians  respectively.  Further,  let  be  the 
population  regularized  Laplacian  of  a  K  +  1-block  SBM  constructed  from  clus¬ 
ters  Ci,  . . . ,  Ck  and  Cw ,  and  block  probability  matrix 


B  = 


Bs  bsw  1  \ 

bsn,  IT  1  J  ’ 


where  the  K  x  K  matrix  Ba,  as  in  Section  4. 

Since  B  has  rank  K  +  1,  the  same  holds  also  for  Jz?T.  We  denote  by  fik,r, 
for  k  =  1 ,  , . . ,  n,  to  be  the  magnitude  of  the  eigenvalues  of  ££T  arranged  in 
decreasing  order.  Notice  that  jlk,T  =  0  for  k  >  K  +  1.  Further,  let  %■  be  the 
n  x  K  eigenvector  matrix  of  «5?r. 

Lemma  11  shows  that  /J2,r  =  ...  =  / Ik,t ,  as  well  as  provides  explicit 
expression  for  these  eigenvalues.  Further,  the  lennna  also  characterizes  the 
norm  of  the  difference  of  the  rows  of  'VT.  In  the  lemma  below  we  denote  by 
dsn  =  nspsn  +  (n  —  Kns)qn  +  nwbsw  and  d™  =  nw  +  (n  —  riw)bsw.  The  quanti¬ 
ties  dsn  and  provide  the  expected  degrees  of  the  nodes  for  an  SBM  drawn 
according  to  B. 


Lemma  11.  The  following  holds: 

1.  The  eigenvalue  fli:T  =  1.  Further,  let  7 n  —  ns{psn  —  qn).  Then 


In 


for  k 


dsn  +  r 
nw( 1 +  r/n) 
d™  +t 


=  2  ,...,K 

nw(bsw  +  r/n) 
dsn  +  r 


(47) 

(48) 


30 


2.  The  matrix  %■  has  I\  + 1  distinct  rows  corresponding  to  the  K  + 1  clusters 
Ci, ,  Ck  and  Cw .  Denote  these  as  centi^, . . . ,  centx,T  and  cent™ . 
Then  1  <  k'  ^  k  <  K 


for  1  <  k  <  K , 


|| centk^T  -  centk',T\\ 


The  above  lemma  is  proved  in  Appendix  C.3.  Let  Tf  be  an  n  x  K  matrix, 
with 

=  centfc,T  for  i  G  Ck. 

Now  yT  has  K  distinct  rows  corresponding  to  the  I\  clusters  Ci, . . . ,  Ck-  We 
denote  these  distinct  rows  as  the  population  cluster  centers.  From  Lemma  2,  if 

||centfc>T  -  centfc/,T||  >  (1  /S)\\VT  -  %\\ /Vns, 

then  /  =  0(62).  Since  || centfe;T  —  centfe/;T||  >;  1  fyfnf  from  Lemma  11,  one  gets 
that  one  needs  to  show  that  ||VT  —  !^-||  <  <5,  with  high  probability,  for  some  <5 
that  goes  to  zero  for  large  n. 

Now, 


\\Vr-%\\  <  \\VT 
=  \\Vr 


%\\+\\%-%\\ 


As  nw  =  0(1),  one  needs  to  show  that  j|yT  —  !^-||  goes  to  zero  with  high 
probability.  From  Davis-Kahan  theorem  we  get  that 


II  Vr 


%\\  < 


ll^-^r|| 


\\Lt-T£t\\  +  \\2>t-T£t\\ 

l^K,r 


(49) 


The  following  lemma  shows  that  for  large  r,  the  Laplacian  matrix  <5?r  is  close 
to  the  Laplacian  matrix  Jz?T  in  spectral  norm. 


Lemma  12. 


Il-^r 


&r\\  < 


1 

1  +  r/d™ 


The  lemma  is  proved  in  Appendix  C.2.  Consequently,  from  Lemma  12  and 
Theorem  4  one  gets  from  (49)  that 


II  K 


%\\  < 


1 

ihK,r  MiC+ljr) 


1 

1  +  r/d 


(50) 
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Further,  from  Lemma  11  one  gets  that 


^K,t 


ns(psn  ~  qn) 

dsn+r 


nw(bs  +  r/n)  _  nw(bsw  +  r/n) 

d%  +  r  dsn  +  T 


It  is  seen  that  (pk,t  —  Pk+i,t)t  converges  to 


ns(psn  -  q„)  -  [nw{bs  -  bsw)  +  ( nw/n)(dsn  -  d™)] , 

which  is  >  ns{psn  —  qn )  using  nw  =  0(1). 

Consequently,  the  right  side  of  (50)  converges  to 


^/<%logn  +  < 
ns{Pn  -  qn) 

for  large  r.  Now,  dsn  x  npsn  and  d%  x  nbsw  (using  nw  =  0(1)).  Consequently, 
the  numerator  in  the  above  is  <  \/nPn  using  Assumption  (26).  Consequently, 
under  Assumption  24,  one  gets  that  \\VT  —  YT\\  goes  to  zero  with  high  probability. 


C.2  Proof  of  Lemma  12 

We  bound  the  spectral  norm  of  Jz?T  —  Jz?T.  Here  JzfT  is  as  in  Appendix  C.l. 
Take  =  2~xl2  (P  +  (r/n)  J)  2~1/2  and  JzfT  =  2^/2  (P  +  (r/n)  j)  J*"1/2. 

Notice  that  we  ignore  the  subscript  r  in  both  2  and  2.  Here,  P  =  ZBZ' ,  with 
B  as  in  Subsection  C.l. 

As  in  the  proof  of  Theorem  4,  given  in  Appendix  A.l,  write 

X  =  2~1/2  (P  +  (r/n)  J)  2~1/2. 

Then, 

\\jzt-jzt\\<\\j?t-^\\  +  \\^-xt\\.  (5i) 

Consequently,  we  prove  that  JzfT  is  close  to  JZ?T  by  showing  that  both  terms  in 
the  right  side  of  (51)  are  small.  We  first  bound  || ££T  —  Jz?/||.  As  in  (38),  write 

\\^t-^;\\<\\i-f\\(2  +  \\i-f\\), 

/  .  .  vl/2 

where  as  before  P— I  =  [l  +  {2  —  2)2~Y J  —  I.  Here  2  =  diag{d\^T,  . . . ,  drtjT), 
and  2  =  diag(d\.T ,  . . . ,  d„jT).  Now, 

IK 2-2)2 -1!!  < 

diT 

<  < 

~  «  +  p>‘ 

Observe  that  we  can  assume  that  ||(^  —  0)0_1|l  <  3/4  for  large  r,  so  that 
(l  +  ||(0-  2)2~1\\\  -  1  <  ||(0-  2)2~1\\, 
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and  thus  ||_2?r  -  Jf^W  ~  d™/(d™  +  r). 

Next,  we  bound  ||Jf^  —  «5?r||.  Notice  that  ||Jz?^ 


^rll< 


The  quantity  ||^_1||  <  l/(d™  +  r).  Further,  note  that  ||P  —  P\\  <  d“,  since 
P  —  P  is  a  matrix  with  all  entries  negative  and  hence  its  spectral  norm  is  at 
most  the  maximum  of  its  row  sums. 


C.3  Proof  of  Lemma  11 

We  investigate  the  eigenvalues  of  the  K  +  1  community  stochastic  block  model 
with  block  probability  matrix 


B  = 


Bs  bsw  1  \ 
baw  l'l'  bw  ) 


In  our  case  bw  =  1.  Denote  the  corresponding  population  Laplacian  by  Jzf. 
Recall  that  from  Subsection  A. 2  the  non-zero  eigenvalues  of  JZ  are  the  same  as 
that  of 

Beig  =  {Z'RZ)1/2B(Z'RZ)1/2 


Now, 


Consequently, 


Z'RZ  =  dia9[^...,n-,n- 

U,n  lln  Lbn 


Beig  — 


n  d 

dlBs 


(ns nw  \ 
d=  d”  ) 

/  s  „\V2 

(  "  "  I  h  V  2—b 

\« )  sw 


One  sees  that 


V  2 . 


bsw  1 


d™ 


«i  =  (y7^,  . . . ,  y/rCc^, 


is  an  eigenvector  of  _Bejg  with  eigenvalue  1.  Next,  consider  a  vector  i>2  =  (u^,  0)'. 
Here  r>2i  is  a  AT  x  1  dimensional  vector  that  is  orthogonal  to  the  constant  vector. 
We  claim  that  so  defined  is  also  an  eigenvector  of  Beig.  To  see  this  notice 
that 

~  ns  l  Bsv 2i 

Beig  V2  =  ^ 

<  y  0 

Here  we  use  the  fact  that  Vvzi  =  0  as  U21  is  orthogonal  to  1.  Next,  notice  that 

Bs  =  (( Psn  -  qn)I  +  qn  11') 


Consequently, 


Bsv 21  =  (p®  -  qn)v 21 


33 


The  above  implies  that  V2  is  an  eigenvector  of  Beig  with  eigenvalue  Ai  given  by 
ns(Pn  -  qn)/dsn. 

Notice  that  from  the  above  construction  one  can  get  K  —  1  orthogonal  eigen¬ 
vectors  Vk,  for  k  =  2, ... ,  K,  such  that  the  v^s  are  also  orthgonal  to  V\.  Essen¬ 
tially,  for  k  >  2,  each  v &  =  Wi  ,  0)',  where  •du1  =  0.  There  are  K—  1  orthogonal 
choices  of  the  Vki’s. 

Given  that  1  and  Ai  are  eigenvalues  of  Beig,  with  the  latter  having  multi¬ 
plicity  K  —  1,  the  remaining  eigenvalue  is  given  by 


A2 


trace(Beig)  -  1  -  {K  -  l)Ai 


n  Pr 


,,J<r  ,^ns  nwbu 
+  (K~l)qn  + 


~  1 


nwb  nwb 

It  Uyj  It  '-'gw 

fJU)  rlS 

iLn  un 


The  claim  regarding  the  eigenvector  corresponding  to  A2  follows  from  seeing 
that  this  should  be  the  case  since  it  is  orthogonal  to  eigenvectors  v\,  . . . ,  un¬ 
defined  above. 


D  Extending  DKest  to  allow  for  degree  hetero¬ 
geneity 


Here,  we  describe  how  we  extend  the  DKest  by  substituting  the  estimate  Jz?T  in 
(30)  with  one  assuming  that  the  data  is  drawn  from  a  degree  corrected  stochastic 
block  model  (D-SBM).  As  mentioned  before,  the  D-SBM  is  a  more  appropri¬ 
ate  model  for  modeling  network  datasets  with  extremely  heterogeneous  node 
degrees.  The  edge  probability  matrix  takes  the  form 

P  =  QZBZ'Q , 


where  0  =  diag{9\1  . . . ,  9n )  models  the  heterogeneity  in  the  degrees. 

As  before,  assume  that  C\  T , . . . ,  Ck,t  be  the  cluster  estimates  obtained  from 
running  RSC-r  Algorithm.  Let  Z  be  the  corresponding  n  x  K  cluster  member¬ 
ship  matrix.  Denote 

^fci,fc2  ~~  'y  A-ij 

j€.Ck2,r 

and  let  B  =  (( b fe1,fc2))  be  the  K  x  I<  with  entries  bklyk2- 

As  in  Karrer  and  Newman  [14],  we  produce  an  estimate  of  the  edge  proba¬ 
bility  matrix  P  given  by 

P  =  QZBZ'Q, 

where  0  =  diag(9 1,  . . . ,  9n),  with 


9i  = 


di 


Efc'=i  bk,k' 
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for  i  £  Ck,T ■  Recall  that  dt  is  the  degree  of  node  i.  It  is  seen  that  with  the 
above  definition  of  0  the  sum  of  the  *-th  row  P  is  simply  d,;. 

The  estimate  Jz?T  is  taken  as  the  population  regularized  Laplacian  corre¬ 
sponding  to  the  estimated  edge  probability  matrix  P.  In  other  words, 

-S?T  =  (D  +  Tiy1/2  (P  +  ^11')  (D  +  tI)~1/2  , 
where  recall  that  D  is  the  diagonal  matrix  of  degrees. 
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